title: “R Notebook” output: html_notebook —
Explore and Summarize white wine product by Rahul Kumar
My Data set consist of 4898 white wines with 11 variables Data fields Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Other:
13 - id (unique ID for each sample, needed for submission)
#library(ggplot2)
#install.packages('knitr',dependencies = T)
#install.packages("lmtest", repos = "http://cran.us.r-project.org")
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(knitr)
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
##
## percent
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:memisc':
##
## collect, recode, rename
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
list.files()
## [1] "explore_and_summarize_data_files"
## [2] "explore_and_summarize_data.html"
## [3] "explore_and_summarize_data.nb.html"
## [4] "explore_and_summarize_data.Rmd"
## [5] "wineQualityWhites.csv"
pf=read.csv('wineQualityWhites.csv')
names(pf)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
head(pf)
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
Summary
summary(pf)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
PLOT EACH VARIABLE
names(pf)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
ggplot(aes(x=fixed.acidity,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
summary(pf$fixed.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
We can see that minimum fixed acidity is 3.8 and Max fixed acidity is 14.2 .Quality of wine is incresing till median value of fixed acidity then it start decresing by incresing fixed.acidity
ggplot(aes(x=volatile.acidity,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
summary(pf$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
We can see that min volatile acidity is 0.080 and max volatile acidity is 1.100 . We can see that quality of wine increses till median value 0.2600 and then it start decresing
ggplot(aes(x=citric.acid,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
summary(pf$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
We can see that min citric acid is 0.00 and max citric acid is 1.66.We can see that wine quality increses till median value 0.32 and then by incresing citric acid wine quality decreses.
ggplot(aes(x=residual.sugar,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
summary(pf$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
We can see that min residual sugar is 0.6 and max sugar is 65.8 .We can see that best amount to give sugar is approx 5.2.If we can increse more sugar quality of wine decrease.
ggplot(aes(x=chlorides,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
summary(pf$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
We can see that min chloride in 0.009 and max chloride is 0.346.We can see that by incresing the amount of chloride till median 0.043 .Its quality is incresing.After that by incresing quantity of chlorides quality of wine decreses.
ggplot(aes(x=free.sulfur.dioxide,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
summary(pf$free.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
We can see that min free sulfur dioxide in 2.00 and max free sulfur dioxide is 289.00.We can see that by incresing the amount of free sulfur dioxide till median 34 .Its quality is incresing.After that by incresing quantity of free sulfur dioxide quality of wine decreses.
ggplot(aes(x=total.sulfur.dioxide,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
summary(pf$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
We can see that min total sulfur dioxide in 9.00 and max total sulfur dioxide is 440.00.We can see that by incresing the amount of total sulfur dioxide till median 134 .Its quality is incresing.After that by incresing quantity of total sulfur dioxide quality of wine decreses.
ggplot(aes(x=density,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
summary(pf$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
We can see that min density in 0.9871 and max density is 1.039.We can see that by incresing the amount of density till median 0.9937 .Its quality is incresing.After that by incresing quantity of density quality of wine decreses.
ggplot(aes(x=pH,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
summary(pf$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
We can see that min pH in 2.720 and max pH is 3.82.We can see that by incresing the amount of pH till median 3.180 ,Its quality is incresing.After that by incresing quantity of pH quality of wine decreses.
ggplot(aes(x=sulphates,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
summary(pf$sulphates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
We can see that min sulphates in 0.220 and max sulphates is 1.080.We can see that by incresing the amount of sulphates till median 0.4700 ,Its quality is incresing.After that by incresing quantity of sulphates quality of wine decreses.
ggplot(aes(x=alcohol,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
summary(pf$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
We can see that min alcohol in 8.2 and max alcohol is 14.2.We can see that by incresing the amount of alcohol till median 10.40 ,Its quality is incresing.After that by incresing quantity of alcohol quality of wine decreses.
summary(pf$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
names(pf)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
pf$alcohol_percentage<-cut(pf$alcohol,c(8,10,12,14,16))
head(pf$alcohol_percentage)
## [1] (8,10] (8,10] (10,12] (8,10] (8,10] (10,12]
## Levels: (8,10] (10,12] (12,14] (14,16]
Answer:- This tidy data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
(worst) —————-> (best) Quality:3,4,5,6,7,8,9 Its a continuous number
Other observations:
Average quality of wine is 5.878 By incresing the amount of ingredient till their median value quality of wine incresing. By increasing the quantity of ingredient above their medain quality of wine decreses. ## What is/are the main feature(s) of interest in your dataset? The main features in the data set are alcohole and quality I’d like to determine which ingredient are best for predicting the quality of a wine I suspect alcohol and some combination of the other variables can be used to build a predictive model to quality of wine ## What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates and alcohol likely contribute to the quality of a white wine I think alcohole contribute most to the quality after researching information on quality of wine
I created a variable for the alcholor percentatage group of wine using the alcohol. This arose in the bivariate section of my analysis when I explored how the quality of a wine varied with its alcohol percentage. At first alcohol percentage grouping was calculated by diving the alcohol percentage into four groups
I calculated the alcohol percentage distribution and find its correlation.Since it is strongly related to quality of wine.I have calculated pH distribution and find its correlation.Its correlated to wine quality.
head(pf)
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality alcohol_percentage
## 1 6 (8,10]
## 2 6 (8,10]
## 3 6 (10,12]
## 4 6 (8,10]
## 5 6 (8,10]
## 6 6 (10,12]
The dimensions of a white wine tend to correlate with each other. The longer one dimension, then the quality of wine is overall. The dimensions also correlate with other variables. Price correlates strongly with alcohol and other variable also
set.seed(1000)
ggpairs(pf,
lower = list(continuous=wrap("points",shape=I('.'))),
upper = list(combo=wrap("box",outlier.shape=I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
cor.test(pf$alcohol,pf$quality)
##
## Pearson's product-moment correlation
##
## data: pf$alcohol and pf$quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
cor.test(pf$alcohol,pf$pH)
##
## Pearson's product-moment correlation
##
## data: pf$alcohol and pf$pH
## t = 8.5601, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.09374446 0.14893205
## sample estimates:
## cor
## 0.1214321
cor.test(pf$pH,pf$quality)
##
## Pearson's product-moment correlation
##
## data: pf$pH and pf$quality
## t = 6.9917, df = 4896, p-value = 3.081e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07162022 0.12707983
## sample estimates:
## cor
## 0.09942725
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),inverse = function(x) x^3)
From a subset of the data,fixed acidity ,total sulphur dioxide do not seem to have strong correlations with quality, but alcohol and pH are moderately correlated with quality. I want to look closer at scatter plots involving quality and some other variables like fixed acidity,alcohol,pH.
ggplot(aes(fixed.acidity,quality),data=pf)+
geom_point()+
scale_x_continuous(trans = cuberoot_trans(),limits = c(6,14),
breaks = c(6,8,10,12,14))+
scale_y_continuous(trans = cuberoot_trans(),limits = c(2,10),
breaks = c(2,4,6,8,10))+
ggtitle('Quality(log10) by cube-root of fixed acidity')
## Warning: Removed 575 rows containing missing values (geom_point).
ggplot(aes(free.sulfur.dioxide,quality),data=pf)+
geom_point()+
scale_x_continuous(trans = cuberoot_trans(),limits = c(0,100),
breaks = c(0,20,40,60,100))+
scale_y_continuous(trans = cuberoot_trans(),limits = c(2,10),
breaks = c(2,4,6,8,10))+
ggtitle('Quality(log10) by cube-root of free sulphur dioxide')
## Warning: Removed 17 rows containing missing values (geom_point).
As free sulphur dioxide quantity increases, the variance in quality increases. We can see that till median value of sulphur dioxide Quality increses more.After that its start decresing.
cor.test(pf$free.sulfur.dioxide,pf$quality)
##
## Pearson's product-moment correlation
##
## data: pf$free.sulfur.dioxide and pf$quality
## t = 0.57085, df = 4896, p-value = 0.5681
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.01985292 0.03615626
## sample estimates:
## cor
## 0.008158067
ggplot(aes(x=pf$alcohol_percentage,y=quality),data=pf)+
geom_boxplot()
summary(pf$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Ideal wine quality have the median 10.40 . This seems really unusual since I would expect quality with an ideal alcohol percentage to have a higher quality. compared to the other groups. There are many outliers. The variation in quality tends to increase as alcohol percentage improves and then decreases for wine quality with increse in alcohol percentage above median value.
summary(pf$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pf$pH_group<-cut(pf$pH,c(2.720,3.02,3.32,3.62,3.82))
head(pf$pH_group)
## [1] (2.72,3.02] (3.02,3.32] (3.02,3.32] (3.02,3.32] (3.02,3.32] (3.02,3.32]
## Levels: (2.72,3.02] (3.02,3.32] (3.32,3.62] (3.62,3.82]
ggplot(aes(x=pf$pH_group,y=quality),data = pf)+
geom_boxplot()
summary(pf$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Ideal wine quality have the median 10.40 . This seems really unusual since I would expect quality with an ideal pH percentage to have a higher quality. compared to the other groups. There are many outliers. The variation in quality tends to increase as pH improves and then decreases for wine quality with increse in pH above median value. # Bivariate Analysis ## Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
Quality correlates strongly with alcohol percentage and the pH.
As alcohol percenate increases, the variance in quality increases till median value. In the plot of quality vs alcohol.Quality of wine increases till median value of alcohol after that it’s start decreasing. The relationship between alcohol and quality is not regular.
Based on the R^2 value, alcohol explains about 43 percent of the variance in price. Other ingredients of interest can be incorporated into the model to explain the variance in the quality
The alcohol percentage and quality tend to correlate with each other. The higher the alcohol percentage , then the greater the pH .
The quality of a wine is positively and strongly correlated with alcohol and pH The variables fixed.acidity and free.sulfur.dioxide also correlate with the quality but less strongly than pH and alcohol. Either pH or alcohol could be used in a model to predict the quality of alcohol, however, both variables should not be used since they are measuring the same quality and show perfect correlation.
ggplot(aes(x=quality,y=free.sulfur.dioxide),
data=subset(pf,!is.na(alcohol_percentage)))+
geom_line(aes(color=alcohol_percentage),stat='summary',fun.y=median)
we can see that that initial quality of wine increses by decresing the free sulphur dioxide till quality 4.After this we can see that by incresing quality of wine increses by incresing the free sulphur dioxide then again by decresing by free sulphur dioxide .Quality value increses.
ggplot(aes(x=quality,y=free.sulfur.dioxide),
data=subset(pf,!is.na(pH_group)))+
geom_line(aes(color=pH_group),stat='summary',fun.y=median)
We can see that that initial quality of wine increses by decresing the free sulphur dioxide till quality 4.After this we can see that by incresing quality of wine increses by incresing the free sulphur dioxide then again by decresing by free sulphur dioxide .Quality value increses.
ggplot(aes(x=quality,y=volatile.acidity),
data=subset(pf,!is.na(pH_group)))+
geom_line(aes(color=pH_group),stat='summary',fun.y=median)
names(pf)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "alcohol_percentage" "pH_group"
We can see that that initial quality of wine increses by incresing the volatile acidity till quality 4.Except pH group (3.02,3.32).After this we can see that increses quality of wine increses by decresing volatile acidity then again by incresing volatile acidity ,Quality value increses. ### Quality vs. fixed sulphur dioxide and alcohol
library(RColorBrewer)
ggplot(aes(free.sulfur.dioxide,quality,color=alcohol_percentage),data=pf)+
geom_point(alpha=0.5,size=1,position = 'jitter')+
scale_color_brewer(type='div',
guide=guide_legend(title='Alcohol Percentage',reverse = T,
override.aes = list(alpha=1,size=2)))+
scale_x_continuous(trans = cuberoot_trans(),limits = c(0,100),
breaks = c(0,20,40,60,100))+
scale_y_continuous(trans = cuberoot_trans(),limits = c(2,10),
breaks = c(2,4,6,8,10))+
ggtitle('Quality(log10) by cube-root of free sulphur dioxide and alcohol')
## Warning: Removed 19 rows containing missing values (geom_point).
The plot indicates that a horizontal model could be constructed to quality of wine of variables using log10(quality) as the outcome variable and cube-root of free sulphur dioxide as the predictor variable.We can see that from the above two graph quality of wine increses till the median value of alchol percentage.By incresing alcohole percentage more than its median value wine quality decreses.
ggplot(aes(free.sulfur.dioxide,quality,color=pH_group),data=pf)+
geom_point(alpha=0.5,size=1,position = 'jitter')+
scale_color_brewer(type='div',
guide=guide_legend(title='pH group',reverse = T,
override.aes = list(alpha=1,size=2)))+
scale_x_continuous(trans = cuberoot_trans(),limits = c(0,100),
breaks = c(0,20,40,60,100))+
scale_y_continuous(trans = cuberoot_trans(),limits = c(2,10),
breaks = c(2,4,6,8,10))+
ggtitle('Quality(log10) by cube-root of free sulphur dioxide and pH group')
## Warning: Removed 18 rows containing missing values (geom_point).
ggplot(aes(volatile.acidity,quality,color=pH_group),data=pf)+
geom_point(alpha=0.5,size=1,position = 'jitter')+
scale_color_brewer(type='div',
guide=guide_legend(title='pH group',reverse = T,
override.aes = list(alpha=1,size=2)))+
scale_x_continuous(trans = cuberoot_trans(),limits = c(0,2),
breaks = c(0.5,1,1.5,2))+
scale_y_continuous(trans = cuberoot_trans(),limits = c(2,10),
breaks = c(2,4,6,8,10))+
ggtitle('Quality(log10) by cube-root of volatile acidity and pH group')
## Warning: Removed 1 rows containing missing values (geom_point).
ggplot(aes(x=pf$pH_group,y=quality),data = pf)+
geom_boxplot()
We can see that quality of wine increses by incresing alcohol value till median value of alcohol value.Ater incresing alcohol value more that its median value quality of wine decreses
ggplot(aes(x=pf$alcohol_percentage,y=quality),data = pf)+
geom_boxplot()
Idealy wines also have the have average alcohol and pH group. The variance of wine quality increses till median of alcohol percentage after that it start decresing.
The last two plots from the Multivariate section suggest that I can build a linear model and use those variables in the model to predict quality of alcohol. The results of the model are summarized below.
Increase and decrese value of quality of alcohol.You can see that by incresing the alcohol percentage till its median value quality of alcohol increses.After incresing alcohol percentage more than its median value.It’s quality decreses.
ggplot(aes(x=pH,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
We can see that quality of wine increses by incresing pH value till median value of pH value.Ater incresing pH value more that its median value quality of wine decreses
ggplot(aes(x=pf$alcohol_percentage,y=quality),data=pf)+
geom_boxplot()
ggplot(aes(x=quality,y=volatile.acidity),
data=subset(pf,!is.na(pH_group)))+
geom_line(aes(color=pH_group),stat='summary',fun.y=median)
## Description Two We can see that from the above two graph quality of wine increses till the median value of alchol percentage.By incresing alcohole percentage more than its median value wine quality decreses. We can also see that quality of wine increses by incresing pH value till median value of pH value.Ater incresing pH value more that its median value quality of wine decreses
ggplot(aes(free.sulfur.dioxide,quality,color=alcohol_percentage),data=pf)+
geom_point(alpha=0.5,size=1,position = 'jitter')+
scale_color_brewer(type='div',
guide=guide_legend(title='Alcohol Percentage',reverse = T,
override.aes = list(alpha=1,size=2)))+
scale_x_continuous(trans = cuberoot_trans(),limits = c(0,100),
breaks = c(0,20,40,60,100))+
scale_y_continuous(trans = cuberoot_trans(),limits = c(2,10),
breaks = c(2,4,6,8,10))+
ggtitle('Quality(log10) by cube-root of free sulphur dioxide and alcohol')
## Warning: Removed 19 rows containing missing values (geom_point).
The plot indicates that a horizontal model could be constructed to quality of wine of variables using log10(quality) as the outcome variable and cube-root of free sulphur dioxide as the predictor variable.We can see that from the above two graph quality of wine increses till the median value of alchol percentage.By incresing alcohole percentage more than its median value wine quality decreses.
The white wine data set contains information on almost 4898 white wines with 11 variables. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of diamonds across many variables and created a linear model to predict diamond prices.
This tidy data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The main features in the data set are alcohole and quality I’d like to determine which ingredient are best for predicting the quality of a wine I suspect alcohol and some combination of the other variables can be used to build a predictive model to quality of wine
(worst) —————-> (best) Quality:3,4,5,6,7,8,9 Its a continuous number
There was a clear trend between the volume or carat weight of a diamond and its price. I was surprised that depth or table did not have a strong positive correlation with price, but these variables are likely to be represented by categorical variables: color, cut, and clarity. I struggled understanding the decrease in median price as the level of cut and clarity improved, but this became more clear when I realized that most of the data contained ideal cut diamonds. For the linear model, all diamonds were included since information on price, carat, color, clarity, and cut were available for all the diamonds. After transforming price to log scale and taking the cube root of carat, the model was able to account for 98.4% of the variance in the dataset.
Some limitations of this model include the source of the data.Given the data set has only 4898 wines data availabel.Which is not very large.These prediction might get wrong.Since it is not population data.To Investigate the data further I would like to gather much more data.I will train the data .I would like to analyze the data which factor describes more quality of wine.I would like to see which combination of ingriedents customers like more.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.